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Abstract 

This paper addresses the construction of a short-vector 
(128D) image representation for large-scale image and 
particular object retrieval. In particular, the method of 
joint dimensionality reduction of multiple vocabularies is 
considered. We study a variety of vocabulary genera¬ 
tion techniques: different k-means initializations, different 
descriptor transformations, different measurement regions 
for descriptor extraction. Our extensive evaluation shows 
that different combinations of vocabularies, each partition¬ 
ing the descriptor space in a different yet complementary 
manner, results in a significant performance improvement, 
which exceeds the state-of-the-art. 

1. Introduction 

L arge-scale image retrieval techniques have been 
developing and improving greatly for more than a 
decade. Many of the current state-of-the-art approaches [21, 
6, 1 1] are based on the bag-of-words (BOW) approach orig¬ 
inally proposed by Sivic and Zisserman [27]. Another pop¬ 
ular image representation arises from aggregating local de¬ 
scriptors like Fisher kernel [2< ] and Vector of Locally Ag¬ 
gregated Descriptors (VLAD) [13]. 

The BOW vectors are high dimensional (up to 64 mil¬ 
lion dimensions in [20]), so, due to the high memory and 
computational requirements, search is limited to a several 
million images on a single machine. There are more scal¬ 
able approaches that tackle this problem by generating com¬ 
pact image representations [28, 24, 13], where the image is 
described by a short vector that can be additionally com¬ 
pressed into compact codes using binarization [28, 30], 
product quantization [12], or recently proposed additive 
quantization techniques [3]. In this paper we propose and 
experimentally evaluate simple techniques that additionally 


boost retrieval performance, but at the same time preserve 
low memory and computational costs. 

Short vector image representations are often generated 
using the principal component analysis (PCA) [4] tech¬ 
nique to perform the dimensionality reduction over high¬ 
dimensional vectors. Jegou and Chum [8] study the ef¬ 
fects of PCA on BOW representations. They show that both 
steps of PCA procedure, i.e., centering and selection of de- 
correlated (orthogonal) basis minimizing the dimensional¬ 
ity reduction error, improve retieval performance. Center¬ 
ing (mean subtraction) of BOW vectors provides a boost in 
performance by adding a higher value to the negative evi¬ 
dence: given two BOW vectors, a visual word jointly miss¬ 
ing in both vectors provides useful information for the sim¬ 
ilarity measure [ 8 ]. Additionnaly, they advocate the joint 
dimensionality reduction with multiple vocabularies to re¬ 
duce the quantization artifacts underlying BOW and VLAD. 
These vocabularies are created by using different initializa¬ 
tions for the k-means algorithm, which may produce rela¬ 
tively highly correlated vocabularies. 

In this paper, we propose to reduce the redundancy of the 
joint vocabulary representation (before the joint dimension¬ 
ality reduction) by varying parameters of the local feature 
descriptors prior to the k-means quantization. In particular, 
we propose: (i) different sizes of measurement regions for 
local description, (ii) different power-law normalizations of 
local feature descriptors, and (iii) different linear projec¬ 
tions (PCA learned) to reduce the dimensionality of local 
descriptors. In this way, created vocabularies will be more 
complementary and joint dimensionality reduction of con¬ 
catenated BOW vectors originating from several vocabular¬ 
ies will carry more information. Even though the proposed 
approaches are simple, we show that they provide signif¬ 
icant boosts to retrieval performance with no memory or 
computational overhead at the query time. 
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Related work. This paper can be seen as an extension 
of [8], details of which are given later in Section 2.3. A 
number of papers report results with short descriptors 
obtained by PC A dimensionality reduction. In [14] 
and [24], aggregated descriptors (VLAD and Fisher vec¬ 
tor respectively) are used followed by PCA to produce 
low dimensional image descriptors. In a paper about 
VLAD [2], authors propose a method for adaptation of 
the vocabulary built on an independent dataset (adapt) and 
intra-normalization (innorm) method that L 2 normalizes 
all VLAD components independently, which suppresses 
the burstiness effect [1 ]. In [15], a ‘democratic’ weighted 
aggregation method for burstiness supression is introduced. 
In this paper, we compare results of all the aforementioned 
methods using low dimensional descriptors D' = 128. 

The rest of the paper is organized as follows: Section 2 
gives a brief overview of several methods: bag-of-words 
(BOW), efficient PCA dimensionality reduction of high di¬ 
mensional vectors, and baseline retrieval with multiple vo¬ 
cabularies. Used datasets and evaluation protocols are es¬ 
tablished in Section 3. Section 4 introduces novel methods 
for joint dimensionality reduction of multiple vocabularies 
and presents extensive experimental evaluations. Main con¬ 
clusions are given in Section 5. 

2. Background and baseline 

This section gives a short overview of the background 
of bag-of-words based image retrieval and the method used 
in [ ]. Key steps and ideas are discussed in higher detail to 
help understanding of the paper. 

2.1. Bag-of-words (BOW) image representation 

First efficient image retrieval based on BOW image rep¬ 
resentation was proposed by Sivic and Zisserman [21]. 
They use local descriptors extracted in an image in order 
to construct a high-dimensional global descriptor. This pro¬ 
cedure follows four basic steps: 

1. For each image in the dataset, regions of interest are 
detected [ 18, 17] and described by an invariant descrip¬ 
tor which is (i-dimensional. In this work we use the 
multi-scale Hessian-Affine [23] and MSER [17] de¬ 
tectors, followed by SIFT [] ] or RootSIFT [1] de¬ 
scriptors. The rotation of the descriptor is either de¬ 
termined by the detected dominant orientation [16], or 
by the gravity vector assumption [2 ]. The descrip¬ 
tors are extracted from different sizes of measurement 
regions [1 ], as described in detail in Section 4 . 

2 . Descriptors extracted from the training (independent) 
dataset (see Section 3) are clustered into k clusters us¬ 
ing the k-means algorithm, which creates a visual vo¬ 
cabulary. 


3 . For each image in the dataset, a histogram of occur¬ 
rences of visual words is computed. Different weight¬ 
ing schemes can be used, the most popular is inverse 
document frequency (idf), which generates a D di¬ 
mensional BOW vector (D = k). 

4. All resulting vectors are L 2 normalized, as suggested 
in [27], producing final global image representations 
used for searching. 

2.2. Efficient PCA of high dimensional vectors 

In most of the cases BOW image representations have 
very high number of dimensions (D can take values up to 
64 million [20]). In these cases the standard PCA method 
(reducing D to D') computing the full covariance matrix is 
not efficient. The dual gram method (see Paragraph 12 . 1.4 
in [4]) can be used to learn the first D' eigenvectors and 
eigenvalues. Instead of computing the D x D covariance 
matrix C, the dual gram method computes the n x n ma¬ 
trix Y t Y , where Y is a set of vectors used for learning, 
and n is the number of vectors in the set Y. Eigenvalue 
decomposition is performed using the Arnoldi algorithm, 
which iteratively computes the D' desired eigenvectors cor¬ 
responding to the largest eigenvalues. This method is more 
efficient than the standard covariance matrix method if the 
number of vectors n of the training set is smaller than the 
number of vector dimensions D , which is usually the case 
in the BOW approach. 

Jegou and Chum [8] analyze the effects of PCA dimen¬ 
sionality reduction on the BOW and VLAD vectors. They 
show that even though PCA successfully deals with the 
problem of negative evidence (higher importance of jointly 
missing visual words in compared BOW vectors), it ignores 
the problem of co-occurrences (co-occurences lead to over¬ 
count some visual patterns when comparing two image vec¬ 
tor representation, see [5]). In order to tackle the aforemen¬ 
tioned problem, they propose performing a whitening op¬ 
eration, similar to the one done in independent component 
analysis [7] (implicitly performed by the Mahalanobis dis¬ 
tance), jointly with the PCA. In our experiments we will 
use dimensionality reduction from D to D' components, as 
done in [8]: 

1. Every image vector v = (v\,..., vd) is post- 

processed using power-law normalization [2 ]: Vi : = 
\vi\P x sign(vi ), with 0 < /3 < 1 as a fixed constant. 
Vector v is L 2 normalized after processing. It has been 
shown [14] that this simple procedure reduces the im¬ 
pact of multiple matches and visual bursts [10]. In all 
our experiments (3 = 0.5, denoted as signed square 
rooting (SSR). 

2 . First D' eigenvectors of matrix C are learned us¬ 
ing power-law normalized training vectors Y = 


2 


[Yi |... | Y n ], corresponding to the largest D' eigenval¬ 
ues Ai,... ? A d' . 

3. Every power-law normalized image descriptor used for 
searching X is PCA-projected and truncated, and at 
the same time whitened and re-normalized to a new 
vector X that is the final short vector representation 
with dimensionality D'\ 

t _ diag(A^,...,A ~J)P T X 

^ ~ . -I _I X rp » ^ 

diag(A 1 2 ,.. •, \ D ? )P T X 

where the D x D' matrix P is formed by the largest 
eigenvectors calculated in the previous step. Com¬ 
paring two vectors after this dimensionality reduction 
with the Euclidian distance is now similar to using a 
Mahalanobis distance. It has been argued that the re¬ 
normalization step is critical for a better comparison 
metric, see [ 8 ]. 

In order to compare results in a fair manner, we will use 
D' = 128 dimensions for all our experiments following the 
trend of previous research in short image representations. 

2.3. The baseline method 

This paper builds upon the work [ 8 ], which is briefly re¬ 
viewed in this section. In [ 8 ], a joint dimensionality reduc¬ 
tion of multiple vocabularies is proposed. Image represen¬ 
tation vectors are separately SSR normalized for each vo¬ 
cabulary, concatenated and then jointly PCA-reduced and 
whitened as explained in the Section 2.2. The idf term is 
ignored, and it is noted that the influence is limited when 
used with multiple vocabularies. Results of this method are 
shown in Figure 1 (right plots). Comparing to the straight¬ 
forward concatenation (Figure 1, left plots) where the re¬ 
sults do not noticeably improve after adding multiple vo¬ 
cabularies, it can be noticed that an improvement in perfor¬ 
mance is achieved even when keeping low memory require¬ 
ments by using PC A dimensionality reduction. However, 
for some vocabularies (i.e. k = 2k), performance is drop¬ 
ping after only few vocabularies used. 

3. Datasets and evaluation 

Results of our methods are evaluated on the datasets [25, 
26, ] that are widely used in the image retrieval area. Also, 
we compare our results with other approaches evaluated on 
the same datasets. 

Oxford5k [25] and Paris6k [26]: Both datasets contain 
a set of images (5062 for Oxford and 6300 for Paris) hav¬ 
ing 11 different landmarks together with distractors, down¬ 
loaded from Flickr by searching for tags of popular land¬ 
marks. For each of the 11 landmarks there are 5 differ¬ 
ent query regions defined by a bounding box, meaning that 


there are 55 different query regions per dataset. The per¬ 
formance is reported as mean average precision (mAP), 
see [25] for more details. In our experiments we use Paris6k 
as a training dataset in order to learn the visual vocabulary 
and projections of PC A dimensionality reduction. When 
evaluating our methods on Oxford5k, we always use the 
data learned on Paris6k. 

Oxford 105k [25]: This dataset is the combination of Ox- 
ford5k dataset and 99782 negative images crawled from 
Flickr using 145 most popular tags. This dataset is used 
to evaluate the search performance (reported as mAP) on 
a large scale. Paris6k is used as a training dataset for Ox¬ 
ford 105k. 

Holidays [9]: This dataset is a selection of personal hol¬ 
idays photos (1491 images) from INRIA, including a large 
variety of scene types (natural, man-made, water and fire ef¬ 
fects, etc.). A sample of 500 images from the whole dataset 
is selected for query purposes [9]. The performance is re¬ 
ported as mAP, like for Oxford5k and Oxford 105k, after 
excluding the query image from the results. As a training 
dataset for vocabulary construction and image representa¬ 
tion level PCA learning we use Paris6k dataset in all exper¬ 
iments. 

4. Sources of multiple codebooks 

We propose combining multiple vocabularies that are 
differing not just in random initialization of clustering pro¬ 
cedure, but also in the data used for clustering. The feature 
data are alternated in the process of local features descrip¬ 
tion. This process is not trying to synthesize appearance de¬ 
formations, but rather varying certain design choices in the 
pipeline of feature description, such as the relative size of 
the measurement region. Vocabularies created in this man¬ 
ner will contain less redundancy. This is combined with 
joint PCA dimensionality reduction (as described in Sec¬ 
tions 2.2 and 2.3) in order to produce short-vector image 
representations that are used for searching the most similar 
images in the dataset. 

Quantization complexity for all vocabularies used in ex¬ 
periments is given in Table 1. As stated in [8], time nec¬ 
essary to quantize 2000 local descriptors of a query image, 
for four k = 8k vocabularies, on 12 cores is 0.45s, using a 
multi-threaded exhaustive search implementation. Timings 
are proportional to the vocabulary size, i.e., to the number 
in the right column of Table 1 . 

Multiple measurement regions. An affine invariant de¬ 
scriptor of an affine covariant region can be extracted from 
any affine covariant constructed measurement region [17]. 
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BOW baseline (nk dimensions) 


Joint PCA of multiple vocabularies (128 dimensions) 





number of concatenated vocabularies 


Joint PCA of multiple vocabularies (128 dimensions) 



Figure 1. Baseline methods: Left plots show mAP performance on Oxford5k (upper plot) and Holidays (lower plot) after straightforward 
concatenation of BOW vectors (no PCA dimensionality reduction performed) generated using multiple vocabularies. Note that dimension¬ 
ality of BOW grows linearly with every new concatenation. Right plots present mAP performance on Oxford5k and Holidays after joint 
PCA dimensionality reduction of concatenated BOW representations to a D' — 128 dimensional vector. 


Table 1. Complexity of vocabularies used throughout the ex¬ 
periments: Complexity is given as a number of vector compar¬ 
isons per local descriptor during the construction of the final BOW 
image representation. 


Vocabulary 

Complexity 

8k 

8192 

4k 

4096 

2k 

2048 

lk 

1024 

4k+2k+... +128 

8064 

2k+lk+... +128 

3968 

lk+512+256+128 

1920 

512+256+128 

896 


As an example of a measurement region that is, in gen¬ 
eral, of a different shape than the detected region, is an 
ellipse fitted to the regions, as proposed by [2 ] and also 
used for MSERs [17]. An important parameter is the rel¬ 
ative scale of the measurement region with respect to the 
scale of the detected region. Since the output of the detec¬ 
tor is designed to be repeatable, it is usually not discrimi¬ 
native. To increase the disriminability of the descriptor, it 
is commonly extracted from area larger than the detected 


region. In case of [23], the relative change in the radius is 
r = 3\/3. The larger the region, the higher discriminability 
of the descriptor, as long as the measurement region covers 
a close-to-planar surface. On the other hand, larger image 
patches have higher chance of hitting depth discontinuities 
and thus being corrupted. An example of multiple mea¬ 
surement regions is shown in Figure 2. To take the best 
of this trade off, we propose to construct multiple vocabu¬ 
laries over descriptors extracted at multiple relative scales 
of the measurement regions. Including lower scales lever¬ 
ages the disadvantages of large measurement regions, while 
joint dimensionality reduction eliminates the dependencies 
between the representations. 

We consider using different sizes of measurement re¬ 
gions: 0.5 x r, 0.75 x r, 1 x r, 1.25 x r, 1.5 x r; cre¬ 
ating slightly different SIFT descriptors used to learn every 
vocabulary. Implementation is very simple and during on¬ 
line stage the computation has to be done only for the fea¬ 
tures from query image region. Though simple, this method 
provides significant improvement even when concatenating 
vocabularies of small sizes (i.e. k = 2k and k = lk), see 
Figure 3 (left plot). We also explore the use of vocabular¬ 
ies with different sizes. All BOW vectors in this case are 
weighted proportionally to the logarithm of their vocabu- 
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0.5 xr 0.75xr lxr 1.25xr 1.5xr 



0.5xr 0.75xr lxr 1.25xr 1.5xr 


Figure 2. Multiple measurement regions (mMeasReg): A corresponding feature is detected in two images (left). Multiple measurement 
regions for a single detected feature are shown in each row. The normalized patches (right) show different image content described by the 
respective descriptor. 


lary size [8]. In each step we concatenate a new bundle 
of vocabularies with multiple sizes, calculated with a dif¬ 
ferent measurement region. We notice improvement when 
using multiple vocabulary sizes as well, see Figure 3 (right 
plot). For presentation of results on both plots in Figure 3, 
in every step we are adding a different vocabulary created 
on SIFT vectors with measurement regions in predefined or¬ 
der: 0.5 x r, 0.75 xr, lxr, 1.25 x r, 1.5 x r. This approach 
is denoted as mMeasReg. 

Multiple power-law normalized SIFT descriptors. 

SIFT descriptors [16] were the popular choice in most of 
the image retrieval systems for a long time. Arandjelovic 
et al. [1] show that using a Hellinger kernel instead of stan¬ 
dard Euclidian distance to measure the similarity between 
SIFT descriptors leads to a noticeable performance boost 
in retrieval system. The kernel is implemented by simply 
square rooting every component of SIFT descriptor. Using 
Euclidian distance on these new RootSIFT descriptors will 
give the same result as using Hellinger kernel on the orig¬ 
inal SIFT descriptors. In general, a power-law normaliza¬ 
tion [24] with any power 0 < (3 < 1 can be applied to the 
descriptors (/3 = 0.5 resulting in RootSIFT [1]). Voronoi 
cells constructed in power-law normalized descriptor spaces 
can be seen as non-linear hyper-surfaces separating the fea¬ 
tures in the original (SIFT) descriptor space. Concatenation 
of such feature space partitionings reduces the redundant 
information. 

There is no additional memory required and the change 
can be done on-the-fly with virtually no additional compu¬ 
tational cost using simple power operation. We consider 
building four different vocabularies using: SIFT and SIFT 
with every component to the power of 0.4, 0.5, 0.6 (denoted 
as SIFT 0 ' 4 , SIFT 0 ' 5 , SIFT 0 ' 6 respectively). Concatenation 


is done on single vocabularies (Figure 4, left plot) and on a 
bundle of vocabularies with different sizes (Figure 4, right 
plot). Adding all SIFT modifications to the process of vo¬ 
cabulary creation achieves noticeable improvement of re¬ 
trieval performance in the case of all vocabulary sizes. We 
denote this method as mRootSIFT. 

Combining vocabularies of different SIFT exponents im¬ 
proves over combining different vocabularies of a single 
SIFT exponent. For example, for 4 x 2k vocabularies, the 
mAP on Oxford5k is 46.5 for 4 x SIFT 0 5 , and 47.7 (Fig- 
ure 4 left) for exponent combination. 

Multiple linear projections of SIFT descriptors. In lo¬ 
cality sensitive hashing (random) linear projections are 
commonly used to reduce the dimensionality of the space 
while preserving locality. The idea pursued in this part 
of the paper is to use linear projections on the feature de¬ 
scriptors (SIFTs) before the vocabulary construction via k- 
means. However, random projections do not reflect the 
structure of the descriptors, resulting in noisy descriptor 
space partitionings. We propose to use PC A learned lin¬ 
ear projections of SIFTs, learned on different training sets 
or subsets. The projections learned this way account for 
the statistics given by the training sets and hence produce 
meaningful distances, while inserting different biases into 
the vocabulary construction. 

The improvement is twofold: (i) increased performance 
measured by mAP, and (ii) shorter quantization time during 
query due to shorter local descriptors after dimensionality 
reduction. On the other side there is a small amount of stor¬ 
age required to save learned projection matrices for every 
vocabulary, which we reuse at query. We consider and eval¬ 
uate three different approaches for learning the eigenvectors 
used to project SIFT vectors from D to D' dimensions: 
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mMeasReg 



mMeasReg with different vocabulary sizes 



Figure 3. Multiple measurement regions (mMeasReg): mAP performance improvement on Oxford5k after PC A reduction to D' — 
128 of concatenated BOW vectors produced on vocabularies created using SIFT descriptors with different measurement regions: 

0.5xr, 0.75xr, lxr, 1.25xr, 1.5xr. 


mRootSIFT mRootSIFT with different vocabulary sizes 




Figure 4. Multiple power-law normalized SIFT descriptors (mRootSIFT): mAP performance improvement on Oxford5k after PC A 
reduction to D' — 128 of concatenated BOW vectors produced on vocabularies created using multiple local feature descriptors: SIFT, 

SIFT 0 4 , SIFT 0 5 , SIFT 0 6 . 


1. We learn eigenvectors on Paris6k dataset and re¬ 
duce the dimension of SIFT descriptors to D' = 
80, 64,48, 32 in the respective order for every newly 
created vocabulary (mPCAi-SIFT). Results of this ex¬ 
periment are shown in Figure 5, 1 st row. 

2. We learn eigenvectors on different datasets: Paris6k, 
Holidays, University of Kentucky benchmark (UKB), 
PASCAL VOC’07 training in the respective order for 
every newly created vocabulary (mPCA 2 -SIFT). Di¬ 
mension of SIFT descriptors is reduced to D' = 80 
in all cases. For the mAP performance on Oxford5k, 
see Figure 5, 2 nd row. 

3. We learn eigenvectors on different datasets: Paris5k, 
Holidays, UKB, PASCAL VOC’07 training and re¬ 
duce the dimension of SIFT descriptors differently for 
each dataset ( D' = 80,64,48,32 respectively) cre¬ 
ating different vocabularies (mPCA 3 -SIFT). Perfor¬ 
mance is presented in Figure 5, 3 rd row. 

Note that first vocabulary in all three different approaches 
is produced using standard SIFT descriptors without PCA 
reduction. A new vocabulary is added in every step of the 
experiment having joint dimensionality reduction of 5 con¬ 
catenated BOW vectors in the end. 


Multiple feature detectors. In the Video Google ap¬ 
proach [27] the authors combine vocabularies created from 
two different feature types. In this paper we attempt to com¬ 
bine Hessian-Affine [23] and MSER [17] detectors. Even 
though straightforward concatenation of BOW vectors cre¬ 
ated on k = 8k vocabularies (48.7 mAP on Oxford5k) 
gives improvement over using single BOW representations 
with Hessian-Affine (44.7) and MSER (40.1) features, af¬ 
ter joint PCA reduction there is a decrease of performance 
when combining features (37.0 mAP on Oxford5k) com¬ 
pared to only doing PCA reduction on a single Hessian- 
Affine vocabulary (38.6), and an increase in performance 
when compared to PCA-reduced BOW vectors built on a 
single MSER vocabulary (24.4). Similar conclusions are 
made when combining smaller vocabulary sizes, i.e., there 
is always a drop in performance when comparing PCA re¬ 
duction on a single vocabulary with Hessian-Affine features 
and PCA on combined vocabularies with Hessian-Affine 
and MSER features; mAP drop: from 39.8 to 39.1, from 
40.7 to 38.7, from 36.8 to 35.1 for k = 4k, 2k, lk re¬ 
spectively. We also experimented with combining Harris- 
Affine [T ] with Hessian-Affine features in the same man¬ 
ner as with MSER, but the improvement is not significant. 
PCA reduction of a single k = 8k vocabulary on Hessian- 
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mPCA -SIFT 


m PC A -SIFT different vocabulary sizes 




number of concatenated vocabularies 


mPCA 2 -SIFT 



mPCA 2 -SIFT with different vocabulary sizes 



number of concatenated vocabularies 



LO 



number of concatenated vocabularies 


Figure 5. Multiple linear projections of SIFT descriptors (mPCA-SIFT): mAP performance improvement on Oxford5k after PC A 
reduction to D' — 128 of concatenated BOW vectors produced on vocabularies created using different PCA-reduced SIFT descriptors. 
For more details about all three presented methods see Section 4. 


Affine yields 38.6 mAP on Oxford5k while joint PCA af¬ 
ter adding a vocabulary of the same size built on Harris- 
Affine improves mAP to 39.0, which is smaller improve¬ 
ment than using two vocabularies built on Hessian-Affine 
features with different randomization (40.0 mAP). 


Discussion. In order to better understand the impact of 
using multiple vocabularies we count the number of unique 
assignments in the product vocabulary. It corresponds to 
the number of non-empty cells of the descriptor space gen¬ 
erated by all vocabularies simultaneously. The maximum 
possible number of unique assignments is equal to the prod¬ 
uct of number of clusters (cells) of all joint vocabularies. 
The number is related to the precision of reconstruction of 
each feature descriptor from its visual word assignments. 
For combination of vocabularies with different SIFT expo¬ 
nents (mRootSIFT) the number of unique assignments for 


Oxford5k dataset is shown in Figure 6. The plots are similar 
for all vocabulary combinations. 

4.1. Comparison with the state-of-the-art 

Comparison with the current methods dealing with short 
vector image representation is given in Table 2. Authors 
of the baseline approach on multiple vocabularies (mVo- 
cab) did not provide results for Oxford5k and Oxfordl05k 
datasets using all of their proposed methods, so we reim¬ 
plemented and presented the corresponding results. Com¬ 
pared to their best method on Oxford5k that achieves 42.9 
mAP, our best method (48.8 mAP) obtains significant rel¬ 
ative improvement of 13.8%. In fact, all our methods out¬ 
perform mVocab baseline methods on Oxford5k by a no¬ 
ticeable margin, with an improvement of 6.1% in the case 
of our worst performing method. When evaluating large- 
scale retrieval on Oxfordl05k dataset our methods again 
outperform the baseline method, relative improvement is 
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Table 2. Comparison with the state-of-the-art on short vector image representation ( D' — 128): Results in the first section of the 
table are mostly obtained from the paper [14], except for the recent method on triangulation embedding and democratic aggregation with 
rotation and normalization (^a+V^+RN) proposed in [ 15]. In the second section we present results from methods that are using joint PCA 
and whitening of high dimensional vectors as we do. Results marked with * are obtained after our reimplementation of the methods using 
feature detector and descriptor as described in Section 2.1 and Paris6k as a learning dataset. In the last section of the table we present 
results of our methods. All methods are described in detail in Section 4. 


Method 

Vocabulary 


Oxford5k 

Oxford 105k 

Holidays 

GIST [22] 

N/A 


— 

— 

36.5 

BOW [27] 

k= 20k 


19.4 

— 

45.2 

Improved Fisher [24] 

k= 64 


30.1 

- 

56.5 

VLAD [13] 

k= 64 


- 

- 

51.0 

VLAD+SSR [] ] 

k= 64 


28.7 

— 

55.7 

^A+V’d+RN [l ] 

k= 16 


43.3 

35.3 

61.7 

mVocab/BOW [8] 

k= 4 x8k 


41.3/41.4* 

—/33.2* 

56.7/63.0* 

mVocab/BOW [8] 

k =l 2 x (32k+ • 

.. +128) 

—/42.9* 

—/35.1* 

60.0/64.5* 

mVocab/VLAD [8] 

k= 4x256 


- 

— 

61.4 

mVocab/VLAD+adapt+innorm [2] 

k= 4x256 


44.8 

37.4 

62.5 

mMeasReg/m Vocab/B OW 

fc=5x2k 


46.9 

38.9 

66.9 

mMeasReg/m Vocab/B OW 

fc=4x (4k+.. 

. +128) 

47.7 

39.2 

67.3 

mRootSIFT/m Vocab/B OW 

A;=4x2k 


47.7 

39.8 

64.3 

mRootSIFT/m Vocab/B OW 

A;=4x(2k+.. 

. +128) 

48.8 

41.4 

65.6 

mPC A 3 - SIFT/mVocab/B OW 

A;=5x2k 


45.8 

38.1 

63.2 

mPCAi -SIFT/mVocab/BOW 

fc=5x(4k+.. 

. +128) 

45.5 

37.8 

64.6 


17.9% for our best performing method, and 7.7% for the 
worst performing one. In order to make a fair compari¬ 
son when evaluating on Holidays dataset we again reimple¬ 
mented the baseline approach, using Paris6k for learning 
the vocabularies and PCA projections (as we did in all our 
methods). In this case, the relative improvement is 4.3% 
with our best method (from 64.5 mAP to 67.3 mAP). We 
also compare our methods to two recent state-of-the-art ap¬ 
proaches on short representations [2, 15]. On Oxford5k 
and Oxford 105k we improve as much as 8.9% and 10.7%, 
respectively, compared to VLAD based approach [2], and 
12.7% and 17.3%, respectively, compared to T-embedding 
based approach [15]. On Holidays dataset relative improve¬ 
ment is 7.7% compared to the former and 9.1% compared 
to the latter. Note that the dataset used for learning of the 
meta-data for Holidays is different: we use Paris6k, while 
both [ 2 ] and [15] are using an independent dataset compris¬ 
ing of 60k images downloaded from Flickr. 

5. Conclusions 

Methods for multiple vocabulary construction were stud¬ 
ied and evaluated in this paper. Following [8], the concate¬ 
nated BOW image representations from multiple vocabular¬ 
ies were subject to joint dimensionality reduction to 128D 
descriptors. We have experimentally shown that generating 
diverse multiple vocabularies has crucial impact on search 
performance. Each of the multiple vocabularies was learned 


on local feature descriptors obtained with varying parameter 
settings. That includes feature descriptors extracted from 
measurement regions of different scales, different power- 
law normalizations of the SIFT descriptors, and applying 
different linear projections to feature descriptors prior to 
k-means quantization. The proposed vocabulary construc¬ 
tions improve performance over the baseline method [8], 
where only different initializations were used to produce 
multiple vocabularies. More importantly, the all proposed 
methods exceed the state-of-the-art results [2, 15] by a large 
margin. The choice of the optimal combination of vocabu¬ 
laries to combine still remains an open problem. 



Figure 6. Number of unique assignments (vocabulary cells) for 
Oxford5k dataset when combining vocabularies built on mul¬ 
tiple power-law normalized SIFT descriptors (mRootSIFT): 

SIFT, SIFT 0 4 , SIFT 0 5 , SIFT 0 6 . 
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